Italian arabic linguistic tools
نویسندگان
چکیده
This paper concerns our participation in the research project: ‘Corpus bilingue Italiano Bilingual Italian – Arabic corpus) funded by law 488/92. The purpose of this project is to develop some linguistic tools and resources for bilingual Italian/Arabic corpora; its background and starting point are tools that have already been developed by the Computational Linguistics Institute. As far as IT tools are concerned, the project consists of four basic elements: a) morphological engine for the Arabic language; b) aligning system for Italian and Arabic parallel texts; c) automatic tagging system for Italian and Arabic texts; d) access tools (and relevant query systems) for the texts of the bilingual corpora at each text-processing step. Introduction In the framework of the comprehensive “Linguistica Computazionale: ricerche monolingui e multilingui” (Computational Linguistics: monolingual and multilingual research) project funded by law 488/1999, the Istituto di Linguistica Computazionale has taken part in the study and development of tools and resources for the Arabic language, as part of the “Corpus bilingue Italiano – Arabo” (Italian – Arabic bilingual corpus) objective. This objective involves the development of a bilingual linguistic work environment, consisting of Italian and Arabic tools and resources, with special attention to the contrastive aspect of it. Bilingual corpora are innovative researching tools that work by comparing relevant languages and/or cultures, that are essential to develop computer-assisted teaching methods and acquire most of the knowledge on which the development of the most promising multilingual IT applications is based (translating aids, information retrieval, data mining, etc.). The objective has been developed in co-operation with the Istituto Universitario Orientale of Naples and the “Dipartimento di Scienze Storiche del Mondo Antico” of Pisa University, which have taken care of developing its linguistic aspect, while we developed all its software features. Linguistic Tools Textual analysis procedures Morphological engines Taggers Aligner Linguistic resources Monolingual reference corpora Automatic lexicons Bilingual aligned corpora Tagged corpora As a background contribution, the Istitituto di Linguistica Computazionale provided the PiSystem, an integrated linguistic analysis system developed by Eugenio Picchi, which has become the standard for many projects based on the study and analysis of different types of texts, and the basic engine of which is the DBT (Data Base Testuale – Textual Data Base) system for the analysis and use of textual resources. The PiSystem features used in the project were its existing Italian modules, such as PiMorfo (Italian morphological engine), PiTagger (automatic Italian morpho-syntactic disambiguator) and Synchro (procedure for the automatic “synchronisation” of parallel texts, already used in Italian-English and Italian-Latin bilingual applications). In addition, such tools have been the basis for the development of matching features in an Italian-Arabic bilingual system. The project in its entirety involves the development of some linguistic resources: generic corpus (8 million words) aligned parallel corpus (4 million words) tagged corpus (2 million words) -morphological lexical resources (20,000 entries) The Arabic textual analysis system and relevant “query system” The 256-type encoding system provided by ISO 88596 (Arabic) charset has been used all through the project, for potential interchange with other partners, acquisition of existing texts and materials, and development of software tools. The Arabic alphabet is composed of 28 letters, which are differently shaped depending on their position (initial, middle, final or isolated), since these letters have to be linked to each other (except a group of six letters) to make words. Extremely important was the decision to adopt one encoding system as much for the acquisition and entry of linguistic materials as for internal representations and processing. Due to the bilingual nature of the project and with a view to being able to use the materials and tools independent of the availability of native Arabic computers and operating systems, the strategy chosen was to develop a proprietary system for the interaction with Arabic materials, i.e. a system that can be interactively used through the keyboard and that gives a correct representation, event without using a specialised Arabic computer or operating system (the development environment is Windows). The keys on the keyboard have been made to match the Arabic alphabet, by selecting it in a manner that matched a standard Arabic keyboard (Fig. 1). Each program was provided with a double function: the above-mentioned matching of the keyboard for normal typing, and the development of a virtual keyboard to be worked on with the mouse to compose a text, queries in particular. The DBT (Data Base Testuale Textual Data Base) system was the basic tool used in the Arabic language project. Such system, however equipped to manage a whole series of non-Latin alphabets, required substantial changes in order to properly work on Arabic texts. It can display all or part of the text, search words, calculate frequencies, define research functions with several words associated in different ways using logic operators, and retrieve all the contexts that fulfil specific search conditions, generate orderly concords, define specific conditions for concord generation, search by regular phrases, etc. The Arabic-alphabet DBT version keeps the characteristics of such language (such as the text displayed from the rightto the left-side), has been instructed through special descriptive tables on how to read the input text encoding: both for a proper display on screen and in print, and for the determination of its proper alphabetic order. These resources have been designed to comply with ISO-8859-6 standard. Morphological engine The morphological engine has been designed to perform a double function: on one side, to generate the inflexion and, from one Arabic entry, automatically generate all its forms (including the their morpho-syntactic classification), while, on the other side, to allow the morphological analysis, that goes back from one form to the entry (or entries) Figure 2: working session using Arabic DBT query system Figure 1: data-entry keyboard to which such form belongs, as well as identify its potential, theoretically valid, morpho-syntactic classifications. To develop such component, we had to: 1. Define the encoding system to be used for a representation of lexical data; definition of the composition, dimension and structure of the Lemmario (entries dictionary); definition of the encoding system, syntax and structure of the “morphological rules” file; 2. Identify groups of entries having the same morphological behaviour and draw up morphological rules based on defined encoding and syntax; 3. Develop a “Lemmario ” file and enter suitable inflexion codes in there. 4. Develop software modules for the development and management of supporting files (lemmario and inflexion rules); 5. Develop software modules for generation and
منابع مشابه
Linguistic Miner: An Italian Linguistic Knowledge System
Linguistic Miner is a project carried out at ILC whose objective is the development of an integrated system to build, organise and manage a corpus of Italian texts (of various origins and formats), and to design and constantly add new tools for the automatic extraction of tiered linguistic knowledge to be made available for many teaching, publishing, and other cultural purposes. The project is ...
متن کاملBourdieu and Genette in Paratext: How Sociology Counts in Linguistic Reasoning
While Bourdieu’s theory of practice provides an ensemble of conceptual tools which analyze patterns of social life that are irreducible to the limiting view of individuals as free-acting agents, Genette’s paratextual theory offers the metalanguage necessary to account for the microcosm of paratext as a linguistic space. This study takes issue with unidirectional approaches to researching parate...
متن کاملRevisiting the Arabic Diglossic Situation and Highlighting the Socio-Cultural Factors Shaping Language Use in Light of Auer’s (2005) Model
In the field of Arabic sociolinguistics, diglossia has been an interesting linguistic inquiry since it was first discussed by Ferguson in 1959. Since then, diglossia has been discussed, expanded, and revisited by Badawi (1973), Hudson (2002), and Albirini (2016) among others. While the discussion of the Arabic diglossic situation highlights the existence of two separate codes (High and Lo...
متن کاملItalian Political Communication and Gender Bias: Press Representations of Men/Women Presidents of the Houses of Parliament (1979, 1994, and 2013)
The study considers mass media communication as intertwined with social norms, as assumed by the perspective of social representations. It explores the Italian press communication by focusing on three pairs of men and women politicians with different political orientations and all serving as presidents of the Houses of Parliament in three legislatures. The article concentrates on five newspaper...
متن کاملRecognizing Textual Entailment in Non-english Text via Automatic Translation into English
We show that a task that typically involves rather deep semantic processing of text—being recognizing textual entailment our case study—can be successfully solved without any tools at all specific for the language of the texts on which the task is performed. Instead, we automatically translate the text into English using a standard machine translation system, and then perform all linguistic pro...
متن کامل